Learning Term-weighting Functions for Similarity Measures

نویسنده

Wen-tau Yih

چکیده

Measuring the similarity between two texts is a fundamental problem in many NLP and IR applications. Among the existing approaches, the cosine measure of the term vectors representing the original texts has been widely used, where the score of each term is often determined by a TFIDF formula. Despite its simplicity, the quality of such cosine similarity measure is usually domain dependent and decided by the choice of the termweighting function. In this paper, we propose a novel framework that learns the term-weighting function. Given the labeled pairs of texts as training data, the learning procedure tunes the model parameters by minimizing the specified loss function of the similarity score. Compared to traditional TFIDF term-weighting schemes, our approach shows a significant improvement on tasks such as judging the quality of query suggestions and filtering irrelevant ads for online advertising.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Evaluation of Similarity Measures for Template Matching

Image matching is a critical process in various photogrammetry, computer vision and remote sensing applications such as image registration, 3D model reconstruction, change detection, image fusion, pattern recognition, autonomous navigation, and digital elevation model (DEM) generation and orientation. The primary goal of the image matching process is to establish the correspondence between two ...

متن کامل

A Comparative Study of Ontology Based Term Similarity Measures on PubMed Document Clustering

Recent research shows that ontology as background knowledge can improve document clustering quality with its concept hierarchy knowledge. Previous studies take term semantic similarity as an important measure to incorporate domain knowledge into clustering process such as clustering initialization and term re-weighting. However, not many studies have been focused on how different types of term ...

متن کامل

Medical Document Clustering Using Ontology-Based Term Similarity Measures

متن کامل

Learning Vector Representations for Similarity Measures

Conventional vector-based similarity measures consider each term separately. In methods such as cosine or overlap, only identical terms occurring in both term vectors are matched and contribute to the final similarity score. Non-identical but semantically related terms, such as “car” and “automobile”, are completely ignored. To address this problem, we propose a novel approach that learns a new...

متن کامل

Term Similarity and Weighting Framework for Text Representation

Expressiveness of natural language is a challenge for text representation since the same idea can be expressed in many different ways. Therefore, terms in a document should not be treated independently of one another since together they help to disambiguate and establish meaning. Term-similarity measures are often used to improve representation by capturing semantic relationships between terms....

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2009

Learning Term-weighting Functions for Similarity Measures

نویسنده

چکیده

منابع مشابه

Evaluation of Similarity Measures for Template Matching

A Comparative Study of Ontology Based Term Similarity Measures on PubMed Document Clustering

Medical Document Clustering Using Ontology-Based Term Similarity Measures

Learning Vector Representations for Similarity Measures

Term Similarity and Weighting Framework for Text Representation

عنوان ژورنال:

اشتراک گذاری